✅ Identify primary keys and foreign keys that link datasets
✅ Understand the different types of joins (left, right, inner, full, semi, anti)
✅ Perform joins in R using dplyr (left_join(), etc.)
✅ Recognize and fix common merge problems (duplicates, mismatched names, missing values)
✅ Distinguish between merging (columns) vs appending (rows)
✅ Apply joins to a real-world dataset (Gapminder)
✅ Debug messy joins with practical tools (anti_join(), cleaning, deduplication)
Motivation: Real analysis rarely lives in one table. Joining is the bridge that lets us answer richer questions.
Why learn merging?
It’s rare that a data analysis task involves only a single data frame. Usually, you have multiple data sets and want to join them together to answer the question you are interested in.
For example, consider wanting to determine the relationship between GDP and the number of cows in a country.
It’s rare that a dataset would have both of these things in common.
However, there are separate datasets for these things:
Gapminder is a well-known dataset that tracks global development statistics over time.
It contains a variety of indicators for most countries in the world and years from 1952 through 2007.
Examples of indicators:
Life expectancy
GDP per capita
Population
Continent
Each row represents a country in a specific year and includes these variables.
Gapminder is popular for teaching data analysis because:
It includes real-world data covering health, wealth, and demographic trends.
It is tidy, clean, and easy to join with other datasets.
Used by Hans Rosling in famous visualizations to show trends in global health and income.
Merge Examples
Note: these examples were inspired by the work of Craig Hutton and his tutorial on joins:
# start by loading in the gapminder data:library(gapminder) # constains the gapminder data, will load it in as an objectlibrary(dplyr)head(gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
Create some different datasets to merge on:
# create a dataset of life expectancy for all non-Asian countries:life_df <- gapminder |>filter(continent !="Asia") |>select(country, year, lifeExp) |>ungroup()# create a dataset on population for all non-European countries:pop_df <- gapminder |>filter(continent !="Europe") |>select(country, year, pop) |>ungroup()# create a dataset on GDP for all countries:gdp_df <- gapminder |>select(country, year, gdpPercap) |>ungroup()
Left join:
If we wanted to add population data for each continent that appears in the life expectancy data frame, we could use the left_join():
# A tibble: 6 × 3
country year pop
<fct> <int> <int>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
What columns should we merge on?
country and year
Left join:
joined_table <- life_df |>left_join( pop_df, by =c("country", "year"))head(joined_table)
# A tibble: 6 × 4
country year lifeExp pop
<fct> <int> <dbl> <int>
1 Albania 1952 55.2 NA
2 Albania 1957 59.3 NA
3 Albania 1962 64.8 NA
4 Albania 1967 66.2 NA
5 Albania 1972 67.7 NA
6 Albania 1977 68.9 NA
If the key columns have different names, you can tell the join function which columns to use with the equality operator
# create a renamed version:life_df_renamed <-rename(life_df, nation = country)joined_table_renamed <- life_df_renamed |>left_join(pop_df, #since the country column is now called "nation" in life_df, #we have to tell left_join which columns to match on.#You'll get an error if you try by = c("continent", "year") this timeby =c("nation"="country","year"))head(joined_table_renamed)
# A tibble: 6 × 4
nation year lifeExp pop
<fct> <int> <dbl> <int>
1 Albania 1952 55.2 NA
2 Albania 1957 59.3 NA
3 Albania 1962 64.8 NA
4 Albania 1967 66.2 NA
5 Albania 1972 67.7 NA
6 Albania 1977 68.9 NA
Right join:
A right join is basically the same thing as a left_join but in the other direction, where the 1st data frame (x) is joined to the 2nd one (y), so if we wanted to add life expectancy and GDP per capita data we could either use:
a right_join() with life_df on the left side and gdp_df on the right side, or
a left_join() with gdp_df on the left side and life_df on the right side
… and get the same result with only the columns arranged differently…
# try via right join:rj <- life_df |>right_join(gdp_df, by =c('year', 'country'))# try via left join:lj <- gdp_df |>left_join(life_df, by =c('year', 'country'))# arrange and select them:rj <- rj |>arrange(country, year, gdpPercap, lifeExp) |>select(country, year, gdpPercap, lifeExp)lj <- lj |>arrange(country, year, gdpPercap, lifeExp) |>select(country, year, gdpPercap, lifeExp)# look at the datahead(rj)
# A tibble: 6 × 4
country year gdpPercap lifeExp
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 779. NA
2 Afghanistan 1957 821. NA
3 Afghanistan 1962 853. NA
4 Afghanistan 1967 836. NA
5 Afghanistan 1972 740. NA
6 Afghanistan 1977 786. NA
head(lj)
# A tibble: 6 × 4
country year gdpPercap lifeExp
<fct> <int> <dbl> <dbl>
1 Afghanistan 1952 779. NA
2 Afghanistan 1957 821. NA
3 Afghanistan 1962 853. NA
4 Afghanistan 1967 836. NA
5 Afghanistan 1972 740. NA
6 Afghanistan 1977 786. NA
# check to see if they are identcial:identical(lj, rj)
[1] TRUE
Full Join:
After aligning rows by matches in the key column(s), a full join retains all rows that appear in x or y
# join the two of themfull_join_df <- life_df |>full_join(gdp_df, by =c('country', 'year')) head(full_join_df)
# check the dimensions of the full join and compare them to the left join:dim(full_join_df)
[1] 1704 4
dim(life_df)
[1] 1308 3
Inner Join:
If you want to work with data that match entries in both data sources, you would use the inner_join():
# Mean life expectancy and population:life_expect_population <- life_df |>inner_join(pop_df, by =c('country', 'year'))# check the dimensions:dim(life_expect_population)
[1] 948 4
# check summary to see that there are no NAs:summary(life_expect_population)
country year lifeExp pop
Algeria : 12 Min. :1952 Min. :23.60 Min. : 60011
Angola : 12 1st Qu.:1966 1st Qu.:44.68 1st Qu.: 2053003
Argentina: 12 Median :1980 Median :52.60 Median : 5068804
Australia: 12 Mean :1980 Mean :54.51 Mean : 14506347
Benin : 12 3rd Qu.:1993 3rd Qu.:64.91 3rd Qu.: 12431196
Bolivia : 12 Max. :2007 Max. :81.23 Max. :301139947
(Other) :876
Anti-Join:
Anti joins return the rows x that do not appear in y:
# keep the rows in GDP that do not appear in the Population df:anti_join_df <- gdp_df |>anti_join(pop_df, by =c('country', 'year'))head(anti_join_df)
# try the other version:anti_join_df_reverse <- pop_df |>anti_join(gdp_df, by =c('country', 'year'))# should have no observations heredim(anti_join_df_reverse)
[1] 0 3
Anti joins can be very useful if you want to know which rows are excluded due to mismatches in the key columns.
Checking for consistencies and inconsistencies between data sources is an important part of the data cleaning process and can often help to uncover data entry or coding errors that should be fixed prior to conducting any analyses.
Binding Rows:
# imagine we have two datasets that we want to append:gapminder1900s <- gapminder |>filter(year <2000)gapminder2000s <- gapminder |>filter(year >=2000)# bind them together via rbind:combined_gapminder <- gapminder1900s |>rbind(gapminder2000s)# look at results head(combined_gapminder)
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
summary(combined_gapminder)
country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Working through a messy exercise together:
library(dplyr)library(stringr)library(tibble)# Messy GDP datagdp_df <-tribble(~country, ~year, ~gdp_percap,"United States", 2007, 48000,"United States ", 2007, 48000, # extra spaces -> duplicate key"Brasil", 2007, 9000,"Viet Nam", 2007, 2300,"DRC", 2007, 300)# Messy life expectancy data (different key names + an extra “Atlantis” row)life_df <-tribble(~nation, ~yr, ~life_exp,"United States", 2007, 78.1,"Brazil", 2007, 72.4, # note spelling “Brazil”"Viet Nam", 2007, 74.3,"Democratic Republic of the Congo", 2007, 47.0,"Atlantis", 2007, 88.8)# A tiny crosswalk you’ll use to harmonize country names -> ISO3crosswalk <-tribble(~raw_name, ~iso3c,"United States", "USA","United States ", "USA","Brasil", "BRA","Brazil", "BRA","Viet Nam", "VNM","DRC", "COD","Democratic Republic of the Congo", "COD","Atlantis", "ATL")
Take some time to inspect keys and duplicates:
# Count potential keys in each data setgdp_dups <- gdp_df |>count(country, year, name ="n") |>filter(n >1)life_dups <- life_df |>count(nation, yr, name ="n") |>filter(n >1)gdp_dups
# A tibble: 0 × 3
# ℹ 3 variables: country <chr>, year <dbl>, n <int>
life_dups
# A tibble: 0 × 3
# ℹ 3 variables: nation <chr>, yr <dbl>, n <int>
Standardize keys:
Trim whitespace; map country names to iso3c via the crosswalk
# A tibble: 5 × 3
iso3c year gdp_percap
<chr> <dbl> <dbl>
1 USA 2007 48000
2 USA 2007 48000
3 BRA 2007 9000
4 VNM 2007 2300
5 COD 2007 300
life_clean <- life_df |>mutate(nation =str_squish(nation)) |>left_join(crosswalk, by =c("nation"="raw_name")) |>transmute(iso3c, year = yr, life_exp)head(life_clean)
# A tibble: 5 × 3
iso3c year life_exp
<chr> <dbl> <dbl>
1 USA 2007 78.1
2 BRA 2007 72.4
3 VNM 2007 74.3
4 COD 2007 47
5 ATL 2007 88.8
Ensure keys are unique before joining:
# If either of these returns rows, you’ve got many-to-many riskgdp_clean |>count(iso3c, year) |>filter(n >1)
# A tibble: 1 × 3
iso3c year n
<chr> <dbl> <int>
1 USA 2007 2
life_clean |>count(iso3c, year) |>filter(n >1)
# A tibble: 0 × 3
# ℹ 3 variables: iso3c <chr>, year <dbl>, n <int>
# Fix: for demo, deduplicate GDP by keeping one row per key (e.g., max gdp_percap)gdp_clean <- gdp_clean |>group_by(iso3c, year) |>summarize(gdp_percap =max(gdp_percap), .groups ="drop")
Join + check:
# Left join GDP <- LifeExp (keeps GDP universe)analytic <- gdp_clean |>left_join(life_clean, by =c("iso3c", "year"))# What didn’t match on the life side?unmatched_in_life <- gdp_clean |>anti_join(life_clean, by =c("iso3c","year"))# What countries appear only in life (e.g., “Atlantis”)?only_in_life <- life_clean |>anti_join(gdp_clean, by =c("iso3c","year"))list(analytic_preview =head(analytic),rows_analytic =nrow(analytic),unmatched_in_life = unmatched_in_life,only_in_life = only_in_life)
$analytic_preview
# A tibble: 4 × 4
iso3c year gdp_percap life_exp
<chr> <dbl> <dbl> <dbl>
1 BRA 2007 9000 72.4
2 COD 2007 300 47
3 USA 2007 48000 78.1
4 VNM 2007 2300 74.3
$rows_analytic
[1] 4
$unmatched_in_life
# A tibble: 0 × 3
# ℹ 3 variables: iso3c <chr>, year <dbl>, gdp_percap <dbl>
$only_in_life
# A tibble: 1 × 3
iso3c year life_exp
<chr> <dbl> <dbl>
1 ATL 2007 88.8
What We Learned Today
🔑 Keys: primary, foreign, compound
🧩 Types of joins and when to use them
🛠️ Syntax: by = c("key1", "key2")
📊 Practical joins with Gapminder data
🚦 Diagnostics: anti_join() to find mismatches, checking duplicates
➕ Appending vs. merging
🧪 Hands-on messy merge exercise
✅ You should now feel comfortable:
Performing merges in R
Auditing and debugging joins
Understanding how joins affect your dataset size/content